Goto

Collaborating Authors

 hypergeometric distribution




Confidence sequences for sampling without replacement Ian Waudby-Smith

Neural Information Processing Systems

We present a generic approach to constructing a frequentist CS using Bayesian tools, based on the fact that the ratio of a prior to the posterior at the ground truth is a martingale. We then present Hoeffding-and empirical-Bernstein-type time-uniform CSs and fixed-time confidence intervals for sampling WoR, which improve on previous bounds in the literature and explicitly quantify the benefit of WoR sampling.


On Design Choices in Similarity-Preserving Sparse Randomized Embeddings

Kleyko, Denis, Rachkovskij, Dmitri A.

arXiv.org Artificial Intelligence

Expand & Sparsify is a principle that is observed in anatomically similar neural circuits found in the mushroom body (insects) and the cerebellum (mammals). Sensory data are projected randomly to much higher-dimensionality (expand part) where only few the most strongly excited neurons are activated (sparsify part). This principle has been leveraged to design a FlyHash algorithm that forms similarity-preserving sparse embeddings, which have been found useful for such tasks as novelty detection, pattern recognition, and similarity search. Despite its simplicity, FlyHash has a number of design choices to be set such as preprocessing of the input data, choice of sparsifying activation function, and formation of the random projection matrix. In this paper, we explore the effect of these choices on the performance of similarity search with FlyHash embeddings. We find that the right combination of design choices can lead to drastic difference in the search performance.


Theoretical Proportion Label Perturbation for Learning from Label Proportions in Large Bags

Kubo, Shunsuke, Matsuo, Shinnosuke, Suehiro, Daiki, Terada, Kazuhiro, Ito, Hiroaki, Yoshizawa, Akihiko, Bise, Ryoma

arXiv.org Artificial Intelligence

Learning from label proportions (LLP) is a kind of weakly supervised learning that trains an instance-level classifier from label proportions of bags, which consist of sets of instances without using instance labels. A challenge in LLP arises when the number of instances in a bag (bag size) is numerous, making the traditional LLP methods difficult due to GPU memory limitations. This study aims to develop an LLP method capable of learning from bags with large sizes. In our method, smaller bags (mini-bags) are generated by sampling instances from large-sized bags (original bags), and these mini-bags are used in place of the original bags. However, the proportion of a mini-bag is unknown and differs from that of the original bag, leading to overfitting. To address this issue, we propose a perturbation method for the proportion labels of sampled mini-bags to mitigate overfitting to noisy label proportions. This perturbation is added based on the multivariate hypergeometric distribution, which is statistically modeled. Additionally, loss weighting is implemented to reduce the negative impact of proportions sampled from the tail of the distribution. Experimental results demonstrate that the proportion label perturbation and loss weighting achieve classification accuracy comparable to that obtained without sampling. Our codes are available at https://github.com/stainlessnight/LLP-LargeBags.


On Efficient and Statistical Quality Estimation for Data Annotation

Klie, Jan-Christoph, Haladjian, Juan, Kirchner, Marc, Nair, Rahul

arXiv.org Artificial Intelligence

Annotated datasets are an essential ingredient to train, evaluate, compare and productionalize supervised machine learning models. It is therefore imperative that annotations are of high quality. For their creation, good quality management and thereby reliable quality estimates are needed. Then, if quality is insufficient during the annotation process, rectifying measures can be taken to improve it. Quality estimation is often performed by having experts manually label instances as correct or incorrect. But checking all annotated instances tends to be expensive. Therefore, in practice, usually only subsets are inspected; sizes are chosen mostly without justification or regard to statistical power and more often than not, are relatively small. Basing estimates on small sample sizes, however, can lead to imprecise values for the error rate. Using unnecessarily large sample sizes costs money that could be better spent, for instance on more annotations. Therefore, we first describe in detail how to use confidence intervals for finding the minimal sample size needed to estimate the annotation error rate. Then, we propose applying acceptance sampling as an alternative to error rate estimation We show that acceptance sampling can reduce the required sample sizes up to 50% while providing the same statistical guarantees.


Estimating Unknown Population Sizes Using the Hypergeometric Distribution

Hodgson, Liam, Bzdok, Danilo

arXiv.org Machine Learning

The multivariate hypergeometric distribution describes sampling without replacement from a discrete population of elements divided into multiple categories. Addressing a gap in the literature, we tackle the challenge of estimating discrete distributions when both the total population size and the sizes of its constituent categories are unknown. Here, we propose a novel solution using the hypergeometric likelihood to solve this estimation challenge, even in the presence of severe under-sampling. We develop our approach to account for a data generating process where the ground-truth is a mixture of distributions conditional on a continuous latent variable, such as with collaborative filtering, using the variational autoencoder framework. Empirical data simulation demonstrates that our method outperforms other likelihood functions used to model count data, both in terms of accuracy of population size estimate and in its ability to learn an informative latent space. We demonstrate our method's versatility through applications in NLP, by inferring and estimating the complexity of latent vocabularies in text excerpts, and in biology, by accurately recovering the true number of gene transcripts from sparse single-cell genomics data.


Overdrawing Urns using Categories of Signed Probabilities

Jacobs, Bart, Stein, Dario

arXiv.org Artificial Intelligence

For drawing (multiple) coloured balls from a statistical urn, we distinguish three well-known modes: 1. hypergeometric or draw-and-delete, which is drawing a ball from the urn without replacement, so that the urn shrinks; 2. multinomial or draw-and-replace: drawing with replacement, so that the urn remains the same; 3. Pólya or draw-and-duplicate, which is drawing a ball from the urn and replacing it together with an additional ball of the same colour, so that the urn grows. Multinomial and Pólya draws may be of arbitrary size, but hypergeometric draws are limited in size by the number of balls in the urn. In this paper we lift this limitation and allow hypergeometric draws of arbitrary size, including'overdraws', containing more balls than in the urn. Physically this is strange, but, as will show, mathematically it makes sense once we allow negative probabilities. Negative probabilities have emerged in quantum physics (e.g. in double slit experiments) and have been discussed in the work of famous physicists like Wigner, Dirac, and Feynman (see e.g.


Differentiable Random Partition Models

Sutter, Thomas M., Ryser, Alain, Liebeskind, Joram, Vogt, Julia E.

arXiv.org Artificial Intelligence

Partitioning a set of elements into an unknown number of mutually exclusive subsets is essential in many machine learning problems. However, assigning elements, such as samples in a dataset or neurons in a network layer, to an unknown and discrete number of subsets is inherently non-differentiable, prohibiting end-to-end gradient-based optimization of parameters. We overcome this limitation by proposing a novel two-step method for inferring partitions, which allows its usage in variational inference tasks. This new approach enables reparameterized gradients with respect to the parameters of the new random partition model. Our method works by inferring the number of elements per subset and, second, by filling these subsets in a learned order. We highlight the versatility of our general-purpose approach on three different challenging experiments: variational clustering, inference of shared and independent generative factors under weak supervision, and multitask learning.


Learning Group Importance using the Differentiable Hypergeometric Distribution

Sutter, Thomas M., Manduchi, Laura, Ryser, Alain, Vogt, Julia E.

arXiv.org Artificial Intelligence

Partitioning a set of elements into subsets of a priori unknown sizes is essential in many applications. These subset sizes are rarely explicitly learned - be it the cluster sizes in clustering applications or the number of shared versus independent generative latent factors in weakly-supervised learning. Probability distributions over correct combinations of subset sizes are non-differentiable due to hard constraints, which prohibit gradient-based optimization. In this work, we propose the differentiable hypergeometric distribution. The hypergeometric distribution models the probability of different group sizes based on their relative importance. We introduce reparameterizable gradients to learn the importance between groups and highlight the advantage of explicitly learning the size of subsets in two typical applications: weakly-supervised learning and clustering. In both applications, we outperform previous approaches, which rely on suboptimal heuristics to model the unknown size of groups. Many machine learning approaches rely on differentiable sampling procedures, from which the reparameterization trick for Gaussian distributions is best known (Kingma & Welling, 2014; Rezende et al., 2014). The non-differentiable nature of discrete distributions has long hindered their use in machine learning pipelines with end-to-end gradient-based optimization. Only the concrete distribution (Maddison et al., 2017) or Gumbel-Softmax trick (Jang et al., 2016) boosted the use of categorical distributions in stochastic networks. Unlike the high-variance gradients of score-based methods such as REINFORCE (Williams, 1992), these works enable reparameterized and lowvariance gradients with respect to the categorical weights. Despite enormous progress in recent years, the extension to more complex probability distributions is still missing or comes with a trade-off regarding differentiability or computational speed (Huijben et al., 2021). The hypergeometric distribution plays a vital role in various areas of science, such as social and computer science and biology. The range of applications goes from modeling gene mutations and recommender systems to analyzing social networks (Becchetti et al., 2011; Casiraghi et al., 2016; Lodato et al., 2015). The hypergeometric distribution describes sampling without replacement and, therefore, models the number of samples per group given a limited number of total samples. Hence, it is essential wherever the choice of a single group element influences the probability of the remaining elements being drawn. Previous work mainly uses the hypergeometric distribution implicitly to model assumptions or as a tool to prove theorems.